Pan-genome de Bruijn graph using the bidirectional FM-index

Background Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index’ backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance. Results We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph’s topology through visualization and sequence alignment. Conclusions We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05531-6.

1 Supplementary Methods

Bidirectional FM-index
In this section, we provide a brief overview of the bidirectional FM-index, which forms the base of our pan-genome graph representation.
.n[ of a text T of length n is defined as BWT Here, SA denotes the suffix array, an array over integer values that indicate the starting positions of the suffixes of T in lexicographic order [2].We build the suffix array using divsufsort (coded by Yuta Mori).There is an important relationship between the characters in the sorted permutation of T (called F ) and the BWT of T (called L), namely the LF property.The LF property states that the ith occurrence of a particular character c in the BWT and the ith occurrence of c in F correspond to the same character in T .We thus need support for Occ(c, i) queries on the BWT that return the number of occurrences of character c in the prefix BWT[0..i[.We realize this using |Σ| bit vectors with constant-time rank support (rank9 algorithm [3]).The LF property can then be computed in constant time as follows: LF[i] = C(c)+Occ(c, i), with c = BWT [i].Here, C(c) denotes the number of characters in T strictly smaller than c.These values are pre-computed and stored in a small array of size |Σ|.Collectively, the BWT, SA, bit vectors and C array are referred to as the (unidirectional) FM-index [4].Often, only the suffix array entries for every s SA th suffix are stored, s SA being the suffix array sparseness factor.This sparse representation of the suffix array requires an auxiliary bit vector with rank support to indicate the presence or absence of index positions and to compute their offsets within the sparse suffix array.Every entry of the suffix array can be computed in O(s SA ) time using the LF property.This is one of the various design choices through the which time-space tradeoff can be controlled.
Exact pattern matching using the FM-index is performed by matching character by character, from right to left.Let  A bidirectional FM-index is obtained by also storing BWT r , the Burrows-Wheeler transform of T r , the reverse of T .By keeping track of both the range [b, e[ over BWT as well as the range [b r , e r [ over BWT r in a synchronized manner, one can extend a pattern P to either cP (extendBackward) or P c (extendForward) in O(|Σ|) time [5].
Table S1 Overview of all components of our bidirectional FM-index, with their respective memory usage.For each component, we clarify the number of entries it contains, and the number of bits needed to store each entry.The number of entries and memory usage of each component is illustrated for the pan-genome of 10 human genomes (s SA = 16).By replacing the Occ data structure with a PrefixOcc data structure, this bound is improved to O(1) [6].Note that both ranges always have the same width, which monotonically decreases when new characters are added to the occurrence (adding a character cannot lead to more occurrences in the reference text).Table S1 details the components of the bidirectional FM-index that is used in our implementation.The compacted reference text uses 3 bits to distinguish 6 characters.Note that the original reference text T (8 bits per character), is necessary to build the representation.Once this process is finished, T can be removed from disk to spare memory.The counts array stores counts for each character in the extended ASCII alphabet, which is why 256 entries are stored.In practice, only 6 of these entries can have non-zero values ('A', 'C', 'G', 'T', '%' and '$').BWT uses 3 bits per character, analogous to the compacted reference text.PrefixOcc, resp.PrefixOcc r , stores the locations of characters 'A', 'C', 'G', 'T', '%' in BWT, resp.BWT r , in 5 bit vectors (accounting for 5 bits per character).Additionally, each of these bit vectors must support constant-time rank operations, which leads to 0.25 bits overhead per character per bit vector (rank9 algorithm [3]).This totals to 6.25 bits per character for tables PrefixOcc and PrefixOcc r .Note that we only need PrefixOcc r , not BWT r itself.To guarantee O(s SA ) suffix array indexing, we store each s SA th suffix in the sparse SA.To check whether a certain index in the SA corresponds to a s SA th suffix or not, we need an additional bit vector indicating the stored entries.This bit vector also needs 0.25 additional bits per entry for constant-time rank support.

Component
Finally, the longest common prefix array or the LCP array is also used for the construction of our pan-genome graph.The LCP array of string T , denoted by LCP, is an array of size n + 1 such that LCP ) for 0 < i < n, where lcp(u, v) denotes the length of the longest common prefix between two strings u and v [7].
2 Supplementary Results  The 95% confidence interval for the runtime is also indicated.Blue: the memory usage of the complete representation of the ccdBG for the pan-genome of 10 human genomes (including the underlying bidirectional FM-index), as a function of the suffix array sparseness factor s SA .
structure.Fig. S1 illustrates the time-space tradeoff that arises when s SA is altered.It shows the runtime of the APM procedure for matching 100 000 Illumina reads to the pan-genome graph of 10 human genomes as well as its memory usage, as a function of s SA .
For s SA = 1, the suffix array is stored at full capacity and the total memory usage is 307.39GiB.When s SA is increased, the memory usage of SA decreases as less of its entries are explicitly stored.For an infinitely large value of s SA , the data structure comprises 81.33 GiB as the memory usage of SA approaches 0 GiB.The performance of the APM procedure decreases as s SA increases, since the SA entries that are not explicitly stored must be recalculated on the fly.We see that the runtime increases exponentially with s SA .Note however that for very small values of s SA (1, 2, 4, 8), the APM performance is barely affected.In fact, the APM performance is better at s SA = 2 than at s SA = 1, presumably due to the high RAM requirements.
Determining the right value for s SA is different in every scenario: large pangenomes require a higher suffix array sparseness factor, whilst smaller datasets benefit from a more complete suffix array.Still, choosing s SA = 16 or s SA = 32 generally leads to a good balance.results because we are only interested in end-to-end alignments in the context of this paper.The number of alignments per read was cut off at 10 for Giraffe to reduce runtime, and because 10 is the expected number of alignments for most reads since 10 genomes are contained in the pan-genome.For this reason, no number of text occurrences higher than 10 is reported in the histograms.Fig. S3 shows the distribution of the number of text occurrences reported per genome by Giraffe.

Comparison of Nexus with BWA-MEM
Tab. S2 contains the same data as visualized in the right scatterplot of Fig. 3 in the paper.To illustrate that Nexus effectively reports a correct minimal edit distance, we investigate the occurrences corresponding to the first row of this table in Fig. S4.

Comparison of Nexus with Bowtie 2
Tab. S3 provides additional alignment results corresponding to Bowtie 2 for different values of the -k parameter (i.e., the maximum number of alignments that can be reported per read).The first row corresponds to the Bowtie 2 entry in Tab. 10 in the paper.Additionally, Fig. S5 provides an in-depth analysis of Nexus versus Bowtie 2 for each of value of the -k parameter, analogous to the analysis conducted for BWA-MEM in the paper.In general, the same conclusions can be drawn for Bowtie 2 as for BWA-MEM.Note that for higher values of the -k parameter, Bowtie 2 detects more occurrences with an edit distance lower than 5 that are not found by Nexus.The explanation for this observation remains the same as for BWA-MEM.

Cytoscape Visualization
In Fig. 5 in the manuscript, we illustrate the visualization of a subgraph of the pan-genome of 341 M. tuberculosis strains in a curated form: • the first k − 1 overlapping characters were omitted from each node; • node identifiers were replaced by a smaller set of character identifiers; • irrelevant nodes were purged from the graph; • the multiplicity of the edges was added to the figure explicitly.
In this section, we include the original subgraph as it is visualized by Cytoscape in Figure S6, which was created using the following command: ) for mapping 100 000 Illumina reads of length 101 bp as well as their reverse complement to the pan-genome graph of 10 human genomes.Left: upset plot that shows the average number of occurrences per read reported by both tools, or exclusively by one tool.Middle: distribution of the reported number of occurrences (on logarithmic scale) in function of the corresponding edit distance.We distinguish occurrences reported by both tools (plotted in function of the edit distance reported by Nexus), and those reported uniquely by only one of the tools.Right: scatter plot visualizing the common occurrences where the edit distance reported by Nexus does not match that of Bowtie 2.  The inputted DNA sequence corresponds to the last three codons from the RRDR region ("TCGGCGCTG"), padded with 18 nucleotides from the reference strain on either side, in order to visualize all nodes that contain a part of these three codons.The dark gray nodes in Fig. S6 correspond to the reference path in the manuscript's Fig. 5 (i.e., node path ADEFGHIK).Furthermore, nodes 111664, 93978 and 73107 in Fig. S6 correspond to respectively nodes B, C and J in Fig. 5 (manuscript), representing RRDR mutations S450L, S450W and L452P.

BWA construction
$ ./bwaindex -p bwa/10HumanGenomes 10HumanGenomes.faA placeholder fasta header line needed to be added to each genome, since BWA cannot handle sequences longer than 2 31 characters.

Giraffe construction
$ ./vgautoindex --workflow giraffe --prefix giraffe/10 → HumanGenomes --ref-fasta 10HumanGenomes.fa--tmp-dir → vgGiraffeKirlia/temp/ --verbosity 2 --threads 1 To be able to run this command, you must have the Giraffe version corresponding to the commit above (or higher).Otherwise, the graph cannot be built without the presence of a .VCF file.Additionally, a placeholder fasta header line needed to be added to each genome, since Giraffe cannot handle sequences longer than 2 31 characters.

Nexus construction
$ ./nexusBuild-s 16 -c 128 -p 10HumanGenomes 25 Where input file 10HumanGenomes.txtcontains a preprocessed version of corresponding 10HumanGenomes.fa.From each strain we removed all N's and substituted them with a random nucleotide.Then we removed the first line, concatenated the chromosomes, removed newline characters, and added a separation character to the end of file for each strain.Next, the strains can be concatenated into one pan-genome.An example of preprocessing a pan-genome of 3 strains is as follows: Where genome1.fasta.non,genome2.fasta.nonand genome3.fasta.nonare the result of the substitution of the N's for the input strains and pangenome.txtis the reference pan-genome used to build the index.genome1.txt,genome2.txtand genome3.txtare intermediate files which can be removed once the pan-genome is obtained.dollar and percent are text files containing a single character '$' and '%', respectively.These files should be present in your directory before executing the above commands.
Since submitting the first version of the manuscript, Nexus v1.1.1 has become available, which is now able to create the index starting from .fasta/.fa/.fnafile as well.Where reads.txt is a preprocessed text file containing only the DNA sequences contained in reads.fastq,no other information.Additionally, reads.txtalso contains the reverse complement of all reads, as A4 does not match the reverse complements automatically.

Nexus
$ ./nexus-e <ED> -s 16 -c 128 -ss custom ../search_schemes/kuch_k → +1_adapted/ 10HumanGenomes 25 reads.fastq With ED ∈ {0,1,2,3,4}.Parameters: • -e: maximum edit distance • -s: suffix array sparseness factor • -c: sparseness factor that indicates how many checkpoints must be stored to identify nodes • -ss: search scheme   Where the bwa/ directory contains the index.Parameters: • -a: output all alignments instead of one per read • -L 1000000,1000000: set a high penalty to avoid 5'-and 3'-end clipping • -Y: if clipping is happens anyway, report all reads as soft clipped, not hard clipped Even though we set a very high penalty to avoid clipping, a small fraction of occurrences still appears to be clipped.Since we are only interested in end-to-end alignment, we focused on non-clipped alignments for analyses such as in Fig. 3 in the paper.The non-clipped reads can be selected as follows: $ grep -Ev "[0-9]+S" bwa/output.sam> bwa/output.notClipped.sam

Bowtie 2
$ ./bowtie2--sensitive --end-to-end -k 10 --time --met-stderr -x → bowtie/10HumanGenomes -U reads.fastq> bowtie/output.sam Where the bowtie/ directory contains the index.Parameters: • --end-to-end: end-to-end alignment • --sensitive: choose the parameter preset that performs sensitive alignment (i.e., prioritizing sensitity over performance to some degree) • -k: maximum number of reported occurrences per read • --time and --met-stderr: parameters for benchmarking To generate Fig. S5, Bowtie 2 also run with -k 100 and -k 1000.Where the giraffe/ directory contains the index.Parameters: • -o: output format of the alignments • -t: number of mapping threads to use • -M: maximum number of reported occurrences per read Similar to BWA-MEM, Giraffe produces clipped alignments.Since we are only interested in end-to-end alignment, we focused on non-clipped alignments for analyses such as in Fig. S2.The non-clipped reads can be selected in the same way as in Section 3.5.2.

A4 and Nexus
See commands in Section 3.4.Additionally, an annotation file is provide that contains the identifiers of these strains in the order in which they appear in the concatenated reference file: https: //github.com/biointec/nexus/releases/download/v1.1.0/MTuberculosisPanGenome. annotation.txt.

Commands for
Using the reference text file, the index can be built as follows: $ ./nexusBuild-s 16 -c 128 -p MTuberculosisPanGenome 19

Reproducing the Case Study
To reproduce the case study, switch to the corresponding branch on GitHub: https: //github.com/biointec/nexus/tree/CaseStudy/casestudy.Compile the code.Run the following command inside the folder containing the index.

$ ./casestudy
This executable searches for compensatory mutations as is described in the manuscript.Comments are provided in the casestudy.cppscript to outline the process.The case study results in a file MTuberculosisPanGenome Compensatory.tsv,which contains the 14 candidate putative compensatory mutations that were also discussed in Table 12 in the manuscript.This file contains 5 fields: • The RRDR mutation to which the candidate putative compensatory mutation corresponds; • The identifier of the node in which the candidate putative compensatory mutation is found; • The number of strains that carry the candidate putative compensatory mutation; • The position of the candidate putative compensatory mutation with respect to the reference strain; • The length of the node containing the candidate putative compensatory mutation.Note again that the pipeline implemented here is more of an ad hoc solution to the problem of candidate compensatory mutations corresponding to mutations in the RRDR region of rpoB, rather than a general pipeline to be readily applied to other problems.For this reason, we provide the functionality in a .cppfile.If there were any interest to extend these ideas into a more general pipeline, possibly written in a more accessible script, do not hesitate to contact the authors of the manuscript.
[b, e[ denote the interval over the suffix array for which the corresponding suffixes have P as a prefix.The suffix array interval [b ′ , e ′ [= extendBackward([b, e[, c) whose suffixes have cP as a prefix can then be computed by b ′ = C(c) + Occ(c, b) and e ′ = C(c) + Occ(c, e).Because Occ(c, i) queries can be performed in constant time, exact matching of a pattern P of size m takes O(m) time.The size of the obtained interval [b, e[ denotes the number of occurrences of P in T .The positions of the occurrences in T can be obtained using the suffix array.

Figure
Figure S1Black: average runtime over 10 experiments for mapping 100 000 Illumina reads of length 101 bp as well as their reverse complement to the pan-genome graph of 10 human genomes, as a function of the suffix array sparseness factor s SA (1 to 256).We perform approximate pattern matching with a maximum allowed number of errors of K = 4, using the search scheme proposed by Kucherov et al..The pan-genome is built for k = 25 and scp = 128.The 95% confidence interval for the runtime is also indicated.Blue: the memory usage of the complete representation of the ccdBG for the pan-genome of 10 human genomes (including the underlying bidirectional FM-index), as a function of the suffix array sparseness factor s SA .
Fig. S2 (a) shows the comparison of Nexus' alignment results with those of PuffAligner.
Figure S2Analysis of the reported occurrences for different aligners for mapping 100 000 Illumina reads of length 101 bp as well as their reverse complement to the pan-genome graph of 10 human genomes.Left: upset plot that shows the number of reads that are aligned by both tools, by only one tool, or by none.Right: the distribution of reads aligned by only one of the two tools as a function of the corresponding number of reported text occurrences.The same histogram is shown in 4 forms to highlight different features of the distributions: on a linear and logarithmic axis for the number of text occurrences (left and right), and on a linear and logarithmic axis for the corresponding number of reads (top and bottom).

Figure S3
Figure S3The number of text occurrences reported per genome by Giraffe, for matching 100 000 Illumina reads (length 101 bp) and their reverse complement to the pan-genome graph of 10 human genomes.It can be observed that on average, approximately 8 occurrences are reported per read, and these occurrences are distributed uniformly over the 10 included genomes.

Figure S5
Figure S5Analysis of the read alignment results of Bowtie 2 (for different maximum numbers of alignments per read) and Nexus (K = 4) for mapping 100 000 Illumina reads of length 101 bp as well as their reverse complement to the pan-genome graph of 10 human genomes.Left: upset plot that shows the average number of occurrences per read reported by both tools, or exclusively by one tool.Middle: distribution of the reported number of occurrences (on logarithmic scale) in function of the corresponding edit distance.We distinguish occurrences reported by both tools (plotted in function of the edit distance reported by Nexus), and those reported uniquely by only one of the tools.Right: scatter plot visualizing the common occurrences where the edit distance reported by Nexus does not match that of Bowtie 2.

Figure S6
Figure S6Cytoscape visualization of a subgraph of the pan-genome ccdBG of 341 M. tuberculosis strains (k = 19), corresponding to the end of the RRDR region of gene rpoB.Dark nodes represent the node path corresponding to the sequence for which visualization was requested.

3. 5 . 4
Giraffe $ ./vggiraffe -t 1 -Z giraffe/10HumanGenomes.giraffe.gbz-m → giraffe/10HumanGenomes.min -d giraffe/10HumanGenomes.dist -f → reads.fastq-o SAM -M 10 2.1 The Effect of the Suffix Array Sparseness Factor on Memory Usage and APMPerformance Adjusting the suffix array sparseness factor can lead to significantly less memory usage, as the complete suffix array is by far the largest component of the data

Table S2
Overview of the number of alignments that are found by Nexus as well as BWA-MEM, but at a different edit distance.This table corresponds to the scatter plot in the right panel of Fig.3in the paper.
TableS3Extension of Tab. 10 in the paper for Bowtie 2 with different values for the -k parameter (the maximum number of alignments reported per read).The Bowtie 2 entry in Tab. 10 corresponds to Bowtie2 -k 10.